AITopics | nyi efficiency

Collaborating Authors

nyi efficiency

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier

Schmidt, Craig W., Reddy, Varshini, Tanner, Chris, Pinter, Yuval

arXiv.org Artificial IntelligenceMar-31-2025

Pre-tokenization, the initial step in many modern tokenization pipelines, segments text into smaller units called pretokens, typically splitting on whitespace and punctuation. While this process encourages having full, individual words as tokens, it introduces a fundamental limitation in most tokenization algorithms such as Byte Pair Encoding (BPE). Specifically, pre-tokenization causes the distribution of tokens in a corpus to heavily skew towards common, full-length words. This skewed distribution limits the benefits of expanding to larger vocabularies, since the additional tokens appear with progressively lower counts. To overcome this barrier, we propose BoundlessBPE, a modified BPE algorithm that relaxes the pretoken boundary constraint. Our approach selectively merges two complete pretokens into a larger unit we term a superword. Superwords are not necessarily semantically cohesive. For example, the pretokens " of" and " the" might be combined to form the superword " of the". This merging strategy results in a substantially more uniform distribution of tokens across a corpus than standard BPE, and compresses text more effectively, with an approximate 20% increase in bytes per token.

machine learning, natural language, pretoken, (16 more...)

arXiv.org Artificial Intelligence

2504.00178

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > Mexico > Mexico City > Mexico City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(12 more...)

Genre:

Workflow (0.66)
Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)

Add feedback

Two Counterexamples to Tokenization and the Noiseless Channel

Cognetta, Marco, Zouhar, Vilém, Moon, Sangwhan, Okazaki, Naoaki

arXiv.org Artificial IntelligenceFeb-29-2024

In Tokenization and the Noiseless Channel (Zouhar et al., 2023a), R\'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest R\'enyi efficiency of the unigram distribution should be chosen. The R\'enyi efficiency is thus treated as a predictor of downstream performance (e.g., predicting BLEU for a machine translation task), without the expensive step of training multiple models with different tokenizers. Although useful, the predictive power of this metric is not perfect, and the authors note there are additional qualities of a good tokenization scheme that R\'enyi efficiency alone cannot capture. We describe two variants of BPE tokenization which can arbitrarily increase R\'enyi efficiency while decreasing the downstream model performance. These counterexamples expose cases where R\'enyi efficiency fails as an intrinsic tokenization metric and thus give insight for building more accurate predictors.

efficiency, nyi efficiency, tokenizer, (15 more...)

arXiv.org Artificial Intelligence

2402.14614

Country:

North America > Canada > Ontario > Toronto (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
(3 more...)

Genre:

Research Report (0.64)
Summary/Review (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)

Add feedback

Tokenization and the Noiseless Channel

Zouhar, Vilém, Meister, Clara, Gastaldi, Juan Luis, Du, Li, Sachan, Mrinmaya, Cotterell, Ryan

arXiv.org Artificial IntelligenceJun-29-2023

Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to \emph{efficient} channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum possible entropy of the token distribution. Yet, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency tokens and very short codes to high-frequency tokens. Defining efficiency in terms of R\'enyi entropy, on the other hand, penalizes distributions with either very high or very low-frequency tokens. In machine translation, we find that across multiple tokenizers, the R\'enyi entropy with $\alpha = 2.5$ has a very strong correlation with \textsc{Bleu}: $0.78$ in comparison to just $-0.32$ for compressed length.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2306.16842

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Switzerland > Basel-City > Basel (0.04)
(7 more...)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.90)

Add feedback